Police forces face a problem of limited resources and where to apply them. If they could gain an insight into where higher crime rates are likely to occur then a more efficient approach could be achieved.
In this capstone we will look at a combination of crime data and venues to see if there is a statistical/ML method of comparing neighborhoods and whether a particular cluster of venues implies a high crime rate.
Also, we shall examine 2 areas, the city of Reading and the city of Oxford. Both of these are cities in the south of England and contain a large university. Oxford has a population of approx 150,00 with Reading just over 210,000.
The ultimate aim and indeed measure of usefulness of this study is to help law enforcement to focus on particular crime types and be able to apply resources more efficiently to a particular area. To keep the size of this study manageable only the following questions will be investigated.
Across a country any area can be portioned into a neighborhood. For the purposes of this Capstone we regard a neighbourhood to cover a few miles in width. For the 2 area in consideration, Reading and Oxford, we shall center on the main shopping/retail/late night entertainment region of each.
A neighborhood will contain areas with differing crime levels. High crime is typically not spread over an area but will be concentrated into so called 'hot-spots'. If we can identify these hot-spots then we can look into whether the kinds of venues in the vicinity of that hot-spot such as shops, bars, cinemas etc indicate a likelihood of crime occurring there in future.
In this study there are two groups of data. One is the crime data with location details and the other is the venue information also with location details.
1. Location of crimes
Here we combine spatial data regarding police neighborhoods, given in JSON format of latitude longitude coordinates, with crime data that also has lat long coordinates.
JSON list of latitude, longitude coordinates defining the boundary of the neighborhood.
Firstly, get the list of police forces in England.
The end point does not require any parameters. The returned JSON is in the form of a list with each list element being "id":"<force id>", "name":"<force name>".
| Endpoint | Return Type | Example |
|---|---|---|
| https://data.police.uk/api/forces | JSON | [ {"id":"bedfordshire","name":"Bedfordshire Police"}, |
Then get the neighborhoods for a police force, (use the force id - my apologies to Star Wars fans). The endpoint requires the force id from the previous step, e.g. bedfordshire. The returned JSON is in the form of a list with each list element being "id":"<neighborhood id>", "name":"<neighborhood name>".
| Endpoint | Return Type | Example |
|---|---|---|
| https://data.police.uk/api/bedfordshire/neighbourhoods | JSON | [{"id":"BD1","name":"Ampthill,Flitwick,Silsoe"}, |
and finally, get the boundary in latitude, longitude pairs for that neighborhood (use the neighborhood id). The endpoint requires the force id from the first step, e.g. bedfordshire, and the neighborhood id from the previous step, e.g. BD2. The returned JSON is in the form of a list with each list element being "latitude": latitude of boundary vertex, "longitude": longitude of boundary vertex.
| Endpoint | Return Type | Example |
|---|---|---|
| https://data.police.uk/api/bedfordshire/BD2/boundary | JSON | [{"latitude":"52.200075375","longitude":"-0.548369535"}, |
A police force is a collection of neighborhoods. The crime data shows the type of crime, time it took place and the latitude, longitude co-ordinates within that police force area. The crime data is split by month into separate csv files.
The data is not split into neighborhoods but is for the whole police force. The boundaries serve to give us a way to filter the data into more manageable data sizes.

It is also possible to query the police API for the same data within a polygon region and for a specific month.
import requests
import json
import pandas as pd
import pickle
import numpy as np
import folium
import math
import os
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from matplotlib.colors import rgb2hex
# import k-means from clustering stage
from sklearn.cluster import KMeans
The regions we are using are already known from examining the available UK police forces and corresponding areas within that police force.
'Abbey / Battle' is the central shopping and bar district in the town of Reading in Berkshire, UK.
'Oxford Central' is the equivalent district in the city of Oxford.
Both regions are simlar in both size and type. They contain universities and substantial shops and bars etc..
Also, both are within the coverage of the Thames Valley Police Force.
# we are comparing 2 regions
regions = [
{ 'force': 'Thames Valley Police', 'hood': 'Abbey / Battle' },
{ 'force': 'Thames Valley Police', 'hood': 'Oxford Central' },
]
# Crime data ENDPOINTS
crime_endpoints = {
# note: this gets a list of all of the police forces in the UK
# no placeholder required
'force': "https://data.police.uk/api/forces",
# note: given a police force id this gets all of the neighbourhoods in that police force region
# {} is a placeholder for the police force id
# example: https://data.police.uk/api/thames-valley/neighbourhoods
'neighbourhood': "https://data.police.uk/api/{}/neighbourhoods",
# note: gets a list in latitude, longitude points defining the neighbourhood boundary shape
# {}/{} are placeholders for, respectively, the police force id and neighbourhood id
# example: https://data.police.uk/api/thames-valley/N464/boundary
'boundary': "https://data.police.uk/api/{}/{}/boundary",
# note:
# {} is set of lat lon pairs lat,lon:lat,lon:....
# {}-{} are placeholders for 4 digit year and 2 digit month, i.e. YYYY and MM
# example: https://data.police.uk/api/crimes-street/all-crime?poly=52.268,0.543:52.794,0.238:52.130,0.478&date=2017-01
'crime': "https://data.police.uk/api/crimes-street/all-crime?poly={}&date={}-{}",
# note: returns the valid set of crime categories as of a given date
# DOC: https://data.police.uk/docs/method/crime-categories/
# {}-{} are placeholders for 4 digit year and 2 digit month, i.e. YYYY and MM
# example: https://data.police.uk/api/crime-categories?date=2011-08
'categories':"https://data.police.uk/api/crime-categories?date={}-{}",
}
# CRIME DATA - 1. get the police force id
endpoint = crime_endpoints['force']
json_forces = requests.get(endpoint).json()
print("endpoint called:")
print(endpoint)
print()
print("top and tail forces from list")
print("=============================")
print(*json_forces[0:2], sep='\n')
print("...\n"*2,end='')
print(*json_forces[-2:], sep='\n')
for region in regions:
force_name = region['force'] #'Thames Valley Police'
force = list(filter(lambda x: x['name'] == force_name, json_forces))[0]
region['force_id'] = force['id']
print()
print(regions)
# CRIME DATA - 2. get the neighbourhood id
#hood_name = r'Abbey / Battle'
for region in regions:
force_id = region['force_id']
endpoint = crime_endpoints['neighbourhood'].format(force_id)
json_hoods = requests.get(endpoint).json()
print("endpoint called:")
print(endpoint)
print()
print("top and tail neighbourhoods from list")
print("=============================")
print(*json_hoods[0:2], sep='\n')
print("...\n"*2,end='')
print(*json_hoods[-2:], sep='\n')
hood_name = region['hood']
hood = list(filter(lambda x: x['name'] == hood_name, json_hoods))[0]
region['hood_id'] = hood['id']
print()
print(regions)
# Now for each hood boundary plus bounding box
for region in regions:
force_id = region['force_id']
hood_id = region['hood_id']
endpoint = crime_endpoints['boundary'].format(force_id, hood_id)
json_boundary = requests.get(endpoint).json()
region['boundary'] = [ [float(latlon['latitude']),float(latlon['longitude'])] for latlon in json_boundary]
lat = [ ll[0] for ll in region['boundary'] ]
lon = [ ll[1] for ll in region['boundary'] ]
region['bbox'] = {
'sw':[min(lat),min(lon)],
'ne':[max(lat),max(lon)]
}
def make_bbox(sw,ne):
return [sw,[sw[0],ne[1]],ne,[ne[0],sw[1]]]
json_crimes = []
year = 2018
month = 6
nmonths = 9
for region in regions:
year = 2018
month = 6
print(region['force'])
print(region['hood'])
region['crimes'] = []
bbox = make_bbox(region['bbox']['sw'], region['bbox']['ne'])
polys = ':'.join([f"{b[0]},{b[1]}" for b in bbox])
for i in range(nmonths):
syear = str(year)
smonth = ("0"+str(month))[-2:]
month += 1
if month == 13:
month = 1
year +=1
endpoint = crime_endpoints['crime'].format(polys, syear, smonth)
region['crimes'].extend(requests.get(endpoint).json())
print(f"{syear}-{smonth} {len(region['crimes'])}")
# for each crime in region['crimes'] I will need
# crime['category']
# crime['location']['latitude'],crime['location']['longitude']
dfcrimes = pd.DataFrame(data=region['crimes'])
# also extract the latitude and longitude into their own columns
dfcrimes['latitude'] = dfcrimes.apply(lambda x: float(x.location['latitude']),axis=1)
dfcrimes['longitude'] = dfcrimes.apply(lambda x: float(x.location['longitude']),axis=1)
region['dfcrimes'] = dfcrimes
for region in regions:
print(f"{region['force']}:{region['hood']}")
print("="*40)
print()
print(region['dfcrimes'].drop(columns=['location','outcome_status']).describe(include =['object', 'float', 'int']))
print('\n'*2)
There are NaNs in both latitude and longitude.
Remove rows where there is a NaN in either or both and then describe again.
for region in regions:
region['dfcrimes'].dropna(subset=['latitude', 'longitude'],inplace=True)
for region in regions:
print(f"{region['force']}:{region['hood']}")
print("="*40)
print()
print(region['dfcrimes'].drop(columns=['location','outcome_status']).describe(include =['object', 'float', 'int']))
print('\n'*2)
Now there are no NaNs in the numerical columns, latitude and longitude.
Persist the results so far to json files.
regions_filename = "regions.pkl"
with open(regions_filename, "wb") as write_file:
pickle.dump(regions, write_file)
# get a global unique set of crime categories
crimetypes = set()
for region in regions:
for crimetype in region['dfcrimes'].category.unique():
crimetypes.add(crimetype)
colors_array = cm.rainbow(np.linspace(0, 1, len(crimetypes)))
type_colors = dict(zip(crimetypes, map(rgb2hex,colors_array)))
type_colors
regions[0].keys()
for region in regions:
region['name'] = region['force']+'-'+region['hood']
print("Constructing the maps of the regions")
map_fill_opacity = 0.7
maps = {}
#region['basemap']
basemaps = {}
for region in regions:
area_centroid_lat = (region['bbox']['sw'][0]+region['bbox']['ne'][0])/2
area_centroid_lng = (region['bbox']['sw'][1]+region['bbox']['ne'][1])/2
# get a map centered on the region
fmap = folium.Map(location=[area_centroid_lat,area_centroid_lng])#, zoom_start=11)
basemaps[region['name']] = fmap
#and zoom in so as to fit the boundary into the view
fmap.fit_bounds(bounds=[region['bbox']['sw'],region['bbox']['ne']])
points = list(map(tuple,region['boundary']))
fg = folium.FeatureGroup(name='boundary')
fg.add_child(
folium.PolyLine(points, color="blue", weight=2.5, opacity=1)
)
fmap.add_child(fg)
# set up the folium layers
fgs = dict( [(x,folium.FeatureGroup(name=''+x+'')) for x in crimetypes] )
#print("region")
for index,crime in region['dfcrimes'].iterrows():
#fg = fgs[crime['Crime type']]
fg = fgs[crime['category']]
lat = crime['latitude']
lon = crime['longitude']
c = type_colors[crime['category']]
label = crime['category']
fg.add_child(
folium.CircleMarker(
[lat, lon],
radius=1,
popup=label,
color=c,
fill=True,
fill_color=rgb2hex(c),
fill_opacity=map_fill_opacity
)
)
for _,fg in fgs.items():
fmap.add_child(fg)
fmap.add_child(folium.LayerControl())
print("adding ",region['name'])
maps[region['name']] = fmap
#region['crime_map'] = fmap
print("..finished..")
print(regions[0]['name'])
maps[regions[0]['name']]
print(regions[1]['name'])
maps[regions[1]['name']]
Now we have the crime data - i.e. the date, the type, the location - then we can gridify the areas into equal sized square cells.
Our method for calculating the crime stats is to simply start with the center of the cell and count the numebr of crimes within a certain radius. We use a search radius that is larger than the cell size. For this report we used 100 metres as the cell size (each side) and 300m as the search radius.
Issues with the data include the locations are in latitude longitude and we need to calculate the straight line distance between the cell center and the crime. For this we transform the points using a combination of functions as describe below:
function Haversine. from https://www.movable-type.co.uk/scripts/latlong.html (2019)
This uses the ‘haversine’ formula to calculate the great-circle distance between two points – that is, the shortest distance over the earth’s surface – giving an ‘as-the-crow-flies’ distance between the points (ignoring any hills they fly over, of course!). Haversine formula:
\begin{align} a & = sin^2(\Delta\phi/2) + cos \phi_1 ⋅ cos \phi_2 ⋅ sin^2(\Delta\lambda/2)\\ c & = 2 ⋅ atan2( \sqrt a, \sqrt(1−a) )\\ d & = R ⋅ c \end{align}where $\phi$ is latitude, $\lambda$ is longitude, R is Earth’s radius (mean radius = 6,371km), note that angles need to be in radians to pass to trig functions!
This is typically used to find the distance between a cell center ($\phi_1$, $\lambda_1$) and the crime ($\phi_2$, $\lambda_2$) and $\Delta\phi = \phi_1 - \phi_2$, $\Delta \lambda = \lambda_1 - \lambda_2$.
function newpointfromdistanceandbearing. from https://www.movable-type.co.uk/scripts/latlong.html (2019)
Given a latitude, longitude point and a distance plus a bearing then return a new point:
\begin{align} \phi_2 & = asin( sin \phi_1 ⋅ cos δ + cos \phi_1 ⋅ sin δ ⋅ cos θ )\\ \lambda_2 & = \lambda_1 + atan2( sin θ ⋅ sin δ ⋅ cos \phi_1, cos δ − sin \phi_1 ⋅ sin \phi_2 ) \\ \end{align}where $\phi$ is latitude, $\lambda$ is longitude, $\theta$ is the bearing (clockwise from north), $\delta$ is the angular distance $d/R$; d being the distance travelled, R the Earth’s radius
def Haversine(lat1,lon1,lat2,lon2, **kwarg):
"""
This uses the ‘haversine’ formula to calculate the great-circle distance between two points – that is,
the shortest distance over the earth’s surface – giving an ‘as-the-crow-flies’ distance between the points
(ignoring any hills they fly over, of course!).
Haversine
formula: a = sin²(Δφ/2) + cos φ1 ⋅ cos φ2 ⋅ sin²(Δλ/2)
c = 2 ⋅ atan2( √a, √(1−a) )
d = R ⋅ c
where φ is latitude, λ is longitude, R is earth’s radius (mean radius = 6,371km);
note that angles need to be in radians to pass to trig functions!
"""
R = 6371.0088
lat1,lon1,lat2,lon2 = map(np.radians, [lat1,lon1,lat2,lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2) **2
c = 2 * np.arctan2(a**0.5, (1-a)**0.5)
d = R * c
return round(d,4)
def newpointfromdistanceandbearing(lat,lon,distance, bearing):
"""
φ2 = asin( sin φ1 ⋅ cos δ + cos φ1 ⋅ sin δ ⋅ cos θ )
λ2 = λ1 + atan2( sin θ ⋅ sin δ ⋅ cos φ1, cos δ − sin φ1 ⋅ sin φ2 )
where φ is latitude, λ is longitude, θ is the bearing (clockwise from north), δ is the angular distance d/R;
d being the distance travelled, R the earth’s radius
"""
R = 6371.0088
angdist = distance/R
lat1,lon1,bearing = map(np.radians, [lat,lon,bearing])
latnew = math.asin(np.sin(lat1)*np.cos(angdist)+
np.cos(lat1)*np.sin(angdist)*np.cos(bearing))
lonnew = lon1+ math.atan2(math.sin(bearing)*math.sin(angdist)*math.cos(lat1),math.cos(angdist)-math.sin(lat1)*math.sin(latnew))
lat2,lon2 = map(np.degrees,[latnew,lonnew])
return lat2,lon2
def get_crimes(df, cell, search_radius):
d1 = df[(df.x >= cell[0]-search_radius) &
(df.x <= cell[0]+search_radius) &
(df.y >= cell[1]-search_radius) &
(df.y <= cell[1]+search_radius)]
return d1[d1.apply(lambda row: distance(row.x-cell[0], row.y-cell[1]) <= search_radius, axis=1)]
cell_size = 100.0
search_radius = 3*cell_size
L2_distance = lambda x,y : math.sqrt(x*x + y*y)
distance = L2_distance
for region in regions:
bbox = region['bbox']
bbox_origin = bbox['sw']
df_ll = region['dfcrimes'][['latitude','longitude']]
df_ll['x'] = df_ll.apply(lambda x: 1000.0*Haversine(bbox_origin[0],bbox_origin[1],x.latitude,bbox_origin[1]),axis=1)
df_ll['y'] = df_ll.apply(lambda x: 1000.0*Haversine(bbox_origin[0],bbox_origin[1],bbox_origin[0],x.longitude),axis=1)
nlat = int(round(1000*Haversine(bbox['sw'][0],bbox['sw'][1],bbox['ne'][0],bbox['sw'][1])/cell_size,0))
nlon = int(round(1000*Haversine(bbox['sw'][0],bbox['sw'][1],bbox['sw'][0],bbox['ne'][1])/cell_size,0))
dlat = newpointfromdistanceandbearing(bbox['sw'][0],bbox['sw'][1],cell_size/1000,0)[0]-bbox['sw'][0]
dlon = newpointfromdistanceandbearing(bbox['sw'][0],bbox['sw'][1],cell_size/1000,90)[1]-bbox['sw'][1]
cell_xy = [( (ix+0.5)*cell_size, (iy+0.5)*cell_size, ix, iy, bbox['sw'][0]+( (ix+0.5)*dlat), bbox['sw'][1]+( (iy+0.5)*dlon) ) for ix in range(nlat) for iy in range(nlon)]
region['crimes'] = [(cell[2],cell[3], cell[4], cell[5], len(get_crimes(df_ll,cell,search_radius)) ) for cell in cell_xy]
region['crimes_info'] = [dlat,dlon,nlat,nlon]
Here we take the area defined by the boundary
%matplotlib inline
import seaborn as sns; sns.set()
min_crimes = 900
fig, axes = plt.subplots(1, 2, figsize = (25,10))
nax = 0
for region in regions:
nx = max([c[0] for c in region['crimes']])
ny = max([c[1] for c in region['crimes']])
cc1 = [c[4] for c in region['crimes']]
data = np.reshape(cc1, (nx+1, ny+1 ))
data = (data > min_crimes) * data
ax = sns.heatmap(data, ax = axes[nax])
ax.set_title(region['name'])
nax += 1
ax.invert_yaxis()
im = ax.collections[0]
region['rgba_values'] = im.cmap(im.norm(im.get_array()))
import copy
hotspot_maps = {}
for region in regions:
#region = regions[0]
crimes = pd.DataFrame(region['crimes'],columns = ['ix','iy','lat','lon','stat'])
min_crimes = 900
region['hotspots'] = crimes[crimes.stat > min_crimes]
region['hotspots_rgbas'] = region['rgba_values'][crimes.stat > min_crimes]
dlat = region['crimes_info'][0]
dlon = region['crimes_info'][1]
area_centroid_lat = (region['bbox']['sw'][0]+region['bbox']['ne'][0])/2
area_centroid_lng = (region['bbox']['sw'][1]+region['bbox']['ne'][1])/2
# get a map centered on the region
hotspot_maps[region['name']] = folium.Map(location=[area_centroid_lat,area_centroid_lng],titles=region['name'], attr="attribution")
#mymap = copy.copy(region['basemap'])
hotspot_maps[region['name']].fit_bounds(bounds=[region['bbox']['sw'],region['bbox']['ne']])
for element in zip(region['hotspots'].iterrows(),region['hotspots_rgbas']):
hotspot = element[0]
rgba = element[1]
folium.Rectangle(
[[hotspot[1].lat-(dlat/2.0), hotspot[1].lon-(dlon/2.0)],[hotspot[1].lat+(dlat/2.0), hotspot[1].lon+(dlon/2.0)] ],
#popup=label,
#popup=None,
popup=(folium.Popup(f"{int(hotspot[1].stat)}")),
color=rgb2hex(rgba),
weight=2,
fill_opacity=0.6,
opacity=0.4,
fill=True,
).add_to(hotspot_maps[region['name']])
regions[0]['hotspots'].head()
print(regions[0]['name'])
hotspot_maps[regions[0]['name']]
print(regions[1]['name'])
hotspot_maps[regions[1]['name']]
pickle.dump( regions, open( "regions2.pkl", "wb" ) )
Now we have a collection of crime hotspots for 2 regions we can get the venues near to those hotspots.
A coordinate can be supplied to the Foursquare API to obtain the venues within a specified radius. Each venue belongs to particular category and also, to a specific level of category. Out of the information per venue we only require the venue category.
The endpoint details are as follows. The client id and client secret are specific to the user and require a new user to sign up to Foursquare. ll is latitude longitude separated by a comma. Version has been set here to "20180604" - see the Foursquare API site for more information. MYRADIUS will be set to the distance from the centroid of the cell from which to gather the venue information, typically set to a multiple of the cell size, e.g. cell size = 100 metres, MYRADIUS = 500 metres. limit can go up to 50 (per the latest version of the docs), we will use 50 in our study.
Foursquare API : https://developer.foursquare.com/docs
A list of categories is also supplied by the Foursquare API.
These are at specified levels. For example, if a restaurant has a category of "Japanese Restaurant" and another one "Asian Restaurant" then they are both under the same general category of Food (Level 0) and Asian Restaurant (Level 1) but the venue information has given us a name that is deeper in the hierarchy (Japanese Restaurant is Level 2). We 'normalise' each venue category into levels. So for the two restaurants just mentioned we might look at Level 0, both Food. Level 1, both Asian Restaurant. Level 2, one is Japanese Restaurant and the other is Not-categorized (my terminology). Note that the level numbers I give here and use in the study are my own invention - Foursquare does not label their data in this way.
Level 0 : Food ├── Level 1 : Afghan Restaurant ├── Level 1 : African Restaurant │ └── Level 2 : Ethiopian Restaurant ├── ... ├── Level 1 : Asian Restaurant │ ├── ... │ ├── Level 2 : Japanese Restaurant │ ├── ... │ ├── Level 2 : Thai Restaurant │ └── Level 3 : Som Tum Restaurant │ └── ... ├── ...
Page describing the Foursquare categories
https://developer.foursquare.com/docs/resources/categories
Page describing the API call
https://developer.foursquare.com/docs/api/venues/categories
API endpoint to receive JSON list of categories
# @hidden_cell
CLIENT_ID = 'VTEV5MX35CKCG2R50PIE4NTJT5TMKM0WNHBXGUSBCKA4W12Y' # your Foursquare ID
CLIENT_SECRET = 'BGIWCG2AV3CEBDTFOE355G4OGEEBYIDKBNMAYV0CHMVVDUN2' # your Foursquare Secret
VERSION = '20190606' # accept API versions up to this date
LIMIT = 100
class FourSquare:
foursq = None
cats = None
def Load_Categories(self, filename):
with open(filename) as json_file: #with open('foursq_categories.json') as json_file:
foursq = json.load(json_file) # foursq = json.load(json_file)
self.foursq = foursq['response']['categories']
del foursq
def AddCategory(self, df, category_name, n,level, parent_idx, icon_url):
df.loc[n] = [n, category_name, level, parent_idx, icon_url]
def Traverse(self, df, cats, n, L, p):
for cat in cats:
n += 1
icon_url = cat['icon']['prefix']+"<SIZE>"+cat['icon']['suffix']
self.AddCategory(df, cat['name'],n,L+1,p, icon_url)
#parent = n
n = self.Traverse(df, cat['categories'], n,L+1,n)
return n
def Build_Categories(self):
columns = ['Index','Category','Level','Parent_idx','Icon_URL']
df = pd.DataFrame(columns=columns)
df = df.astype(dtype= {"Index":"int64",
"Category":"object",
"Level":"int64",
"Parent_idx":"int64",
"Icon_URL":"object",
})
self.Traverse(df,self.foursq,0,0,0)
self.cats = df
#return df
def GetCategory(self,cat_tree, category, level):
#print("----GetCategory----")
row = cat_tree[cat_tree.Category == category]
this_level = row.iloc[0]['Level']
if this_level > level:
for i in range(this_level - level):
row = cat_tree.loc[cat_tree['Index'] == row.iloc[0]['Parent_idx'] ]
if level > this_level:
pass
return row.iloc[0]['Category']
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name, end=':')
print('\r', end='')
#print(f"name={name},lat={lat},lon={lon}")
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
limit)
# make the GET request
#print(f"url={url}")
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
nearby_venues.drop_duplicates(subset=['Neighborhood', 'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category'], inplace=True)
print()
return(nearby_venues)
LIMIT=100
RADIUS = 300
for region in regions:
print(f"{region['name']}")
#print(region['hotspots'])
names = [ e[0] for e in enumerate(region['hotspots'].index)]
latitudes = [ e for e in region['hotspots'].lat ]
longitudes = [ e for e in region['hotspots'].lon ]
region['venues'] = getNearbyVenues(names = names,
latitudes = latitudes,
longitudes = longitudes,
radius = RADIUS,
limit = LIMIT
)
print()
print("..finished")
for region in regions:
print(f"{region['force']}:{region['hood']}")
venues = region['venues']
print(venues.shape)
print(venues.sample(4))
print()
print('There are {} unique categories in this region.'.format(len(venues['Venue Category'].unique())))
print("="*40)
foursq_file = 'foursq.pkl'
fq = FourSquare()
if os.path.isfile(foursq_file):
with open(foursq_file,"rb") as f:
fq = pickle.load(f)
else:
fq.Load_Categories('foursq_categories.json')
fq.Build_Categories()
print(f"There are {len(fq.cats)} categories")
fq.cats.sample(5)
with open(foursq_file,"wb") as f:
pickle.dump(fq,f)
#lets get Level Categories for the venues
if True:#if not 'grouped' in regions[0].keys():
MAX_LEVEL = 4
for region in regions:
#region = regions[0]
venues = region['venues']
region['grouped'] = {}
print(region['name'])
for level in range(1,MAX_LEVEL+1):
print(f"level={level}")
level_cats = venues['Venue Category'].apply(lambda x : fq.GetCategory(fq.cats,x, level))
region_onehot = pd.get_dummies(level_cats, prefix="", prefix_sep="")
# add neighborhood column back to dataframe
region_onehot['Neighborhood'] = venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [region_onehot.columns[-1]] + list(region_onehot.columns[:-1])
region_onehot = region_onehot[fixed_columns]
#region_onehot.sample(10)
#region_onehot.shape
region_grouped = region_onehot.groupby('Neighborhood').mean().reset_index()
region['grouped'][level] = region_grouped
Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
This is described above.
Here all of the crimes within the radius were included. Each crime belongs to a particular type of crime category. The full list is as follows:
from: https://data.police.uk/api/crime-categories?date=2019-12
| url | name |
|---|---|
| anti-social-behaviour | Anti-social behaviour |
| bicycle-theft | Bicycle theft |
| burglary | Burglary |
| criminal-damage-arson | Criminal damage and arson |
| drugs | Drugs |
| other-theft | Other theft |
| possession-of-weapons | Possession of weapons |
| public-order | Public order |
| robbery | Robbery |
| shoplifting | Shoplifting |
| theft-from-the-person | Theft from the person |
| vehicle-crime | Vehicle crime |
| violent-crime | Violence and sexual offences |
| other-crime | Other crime |
Some crime categories could be removed in the study to reflect the particular questions being asked. If one were to be analysing theft then one might only include 'Bicycle theft', 'Other theft',' Shoplifting', 'Theft from the person' and exclude the rest. This can be achieved by filtering the crime data on 'category'.
Example
| Cell | Cell Latitude | Cell Longitude | Venue | Venue Latitude | Venue Longitude | Venue Category | |
|---|---|---|---|---|---|---|---|
| 124 | 2 | 51.454018 | -0.972134 | Market Place | 51.455566 | -0.969469 | Plaza |
| 129 | 2 | 51.454018 | -0.972134 | Greggs | 51.455459 | -0.969436 | Bakery |
| 80 | 2 | 51.454018 | -0.972134 | Côte Brasserie | 51.453729 | -0.969215 | French Restaurant |
| 126 | 2 | 51.454018 | -0.972134 | Royal Tandoori | 51.454898 | -0.968959 | Indian Restaurant |
| 103 | 2 | 51.454018 | -0.972134 | The Botanist | 51.455107 | -0.969595 | English Restaurant |
For instance, 'Plaza' in the list of Four Square categories is under 'Outdoors & Recreation'.
Outdoors & Recreation
|-- Plaza
Outdoors.. is assigned a level 1 designation as it is the topmost layer. Plaza is then assigned level 2.
1 Outdoors & Recreation
|-- 2 Plaza
1 Food
|-- 2 Bakery
|-- 2 French Restaurant
| |-- 3 Burgundian Restaurant
|-- 2 Indian Restaurant
| |-- 3 Andhra Restaurant
| |-- 3 Awadhi Restaurant
|-- 2 English Restaurant
The one-hot encoding calculates the frequencies for those venue categories.
Given the following venues and level 1 we would have the following one-hot table (these are truncated for display purposes).
Venues for cell
Plaza, Bakery, Bakery, French Restaurant, Burgundian Restaurant, English Restaurant
One-Hot Table Level 1
| Arts & Entertainment | College & University | Events | Food | Nightlife Spot | Outdoors & Recreation | Professional & Other Places | Residence | Shop & Service | Travel & Transport | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 0.0 | 0.0 | 0.8333 | 0.0 | 0.1666 | 0.0 | 0.0 | 0.0 | 0.0 |
One-Hot Table Level 2
| Amphitheater | Aquarium | ... | Plaza | Bakery | French Restaurant | English Restaurant |
|---|---|---|---|---|---|---|
| 0.0 | 0.0 | ... | 0.1666 | 0.3333 | 0.3333 | 0.1666 |
for level in range(1,MAX_LEVEL+1):
region_grouped_clustering = region['grouped'][level]
.drop('Neighborhood', 1)
Sum_of_squared_distances = []
K = range(1,15)
for kclusters in K:
kmeans = KMeans(n_clusters=kclusters, random_state=0)
.fit(region_grouped_clustering)
Sum_of_squared_distances.append(kmeans.inertia_)
# Optimal K example
K = np.array([1,2,3,4,5,6,7,8,9,10])
inertia = np.array([0.38, 0.15, 0.10, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01])
plt.plot( K, inertia, 'bx-', label='inertia')
plt.plot( K[1:], -np.diff(inertia), label='1st difference')
plt.plot( K[1:-1], -np.diff(-np.diff(inertia)), label='2nd difference')
_ = plt.legend()
We take the first difference (the delta) and then take the difference of the deltas (the gamma or convexity).
In the above the greatest 2nd difference is for $K = 2$. This becomes our optimal number of clusters, 'optimal K'.
Then apply that particular K-means to obtain the cluster corresponding to the cells.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
def kmeans_optimal_elbow(criterion):
n = len(criterion)
delta1 = -np.diff(criterion)
delta2 = -np.diff(delta1)
delta1 = np.concatenate(([0,0], delta1), axis=0)[-n:]
delta2 = np.concatenate(([0,0], delta2), axis=0)[-n:]
'''
strength = delta2 - delta1
strength = strength * (strength >0)*1.0
k = np.array([k+1 for k in range(n)])
rel_strength = strength / k
max_k = rel_strength.argmax()-1
'''
k = np.array([k+1 for k in range(n)])
max_k = delta2.argmax()-1
return k[max_k]
# ## 4. Cluster Neighbourhoods
num_top_venues = 5
MAX_LEVEL = 4
for region in regions:
region_name = f"{region['force']}:{region['hood']}"
region_data = region['hotspots'][['lat','lon']].reset_index()
region_data['Neighborhood'] = region_data.index
region['clusters'] = {}
region['kmeans'] = {}
for level in range(1,MAX_LEVEL+1):
region_grouped_clustering = region['grouped'][level].drop('Neighborhood', 1)
Sum_of_squared_distances = []
# the number of samples should be greater than or equal to the number of clusters
K = range(1,min(15, len(region_grouped_clustering)))
for kclusters in K:
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(region_grouped_clustering)
Sum_of_squared_distances.append(kmeans.inertia_)
criterion = Sum_of_squared_distances
optimal_k = kmeans_optimal_elbow(criterion)
print(f"level {level} optimal_k={optimal_k}")
get_ipython().run_line_magic('matplotlib', 'inline')
plt.plot(K, criterion, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title(f"Elbow Method For Optimal k:{region_name}, Level={level}")
plt.axvline(x=optimal_k, ymin=0.0, ymax=1,linestyle='--')
plt.show()
# get the cluster info
kmeans = KMeans(n_clusters=optimal_k, random_state=0).fit(region_grouped_clustering)
region['kmeans'][level] = kmeans
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = region['grouped'][level]['Neighborhood']
for ind in np.arange(region['grouped'][level].shape[0]):
try:
res = return_most_common_venues(region['grouped'][level].iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.iloc[ind, 1:] = res
except:
neighborhoods_venues_sorted.iloc[ind, 2:] = res
if 'Cluster Labels' in neighborhoods_venues_sorted.columns:
neighborhoods_venues_sorted.drop(['Cluster Labels'],inplace=True,axis=1)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
region_merged = region_data
# merge region_grouped with region_data to add latitude/longitude for each neighborhood
region_merged = region_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
region['clusters'][level] = region_merged
pickle.dump(regions, open("regions3.pkl","wb"))
from IPython.display import display, Markdown, Latex
def ShowResults(region_idx, level):
txt = regions[region_idx]['clusters'][level] \
.drop(['Neighborhood','lat','lon', 'index'],axis=1)
txt = txt \
.drop_duplicates() \
.astype({'Cluster Labels':'int32'}) \
.sort_values(by=['Cluster Labels','1st Most Common Venue','2nd Most Common Venue','3rd Most Common Venue','4th Most Common Venue','5th Most Common Venue']) \
.to_html(index=False) \
.replace("\n","")
display(Markdown(f"#### **Region {region_idx}, Level {level}**"))
display(Markdown(txt))
_ = ShowResults(0,1)
_ = ShowResults(1,1)
This is an interesting set of results. For both regions the primary influence are Food venues with Shops Travel and Nightlife venues also having a large influence.
By going to level 2 venue categories we may gain an insight into the kinds of Food, Shops etc venues.
_ = ShowResults(0,2)
_ = ShowResults(1,2)
For both regions Bars look to be the primary influences with important but secondary being Coffee Shops and Asian Restaurants.
_ = ShowResults(0,3)
_ = ShowResults(1,3)
At level 3 Coffee Shop and Pub are the primary influences.
_ = ShowResults(0,4)
_ = ShowResults(1,4)
At level 4 Coffee Shop and Pub are the primary influences.
Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
While Food related venues are the primary influence at level 1, at higher levels it loses out to such venues as Coffee SHops and Bars.
The one-hot encoding looks at the relative frequencies of venues in the area surrounding the cell but this does not take into account the numbers of venues. One cell may have 20 venues nearby with another one having only 5 but their relative frequencies could be the same. To this algorithm there would seem to be no difference.
The higher levels (3 and 4) do not give much more information since the venue categories across both levels are almost the same. This is due to the original categorisation by FourSquare but is not necessarily a problem since not using level 4 can make the investigation quicker.
Could include the non-hotspots to see if they can be predicted. We have only covered the venue types for crime hotspots. It would make an interesting project to see if there is a way to include non-hotspot information to predict crime hotspots.
To determine crime hotspots we used a simple total number of crimes as the statistic. Another method is to use an area based relative statistic such as Getis-Ord which looks at the local totals vs the normally distributed totals across the area.
We have used the crimes across all of the crime categories but perhaps there is a subset that could be more useful for crime hotspot prediction.
We used a cell size of 100 metres with a search radius of 300 metres for both crime search. What about 50m, or 100m?
A naive but effective way to determine crime hotspots would be to merely look at the number of venues near to a cell.
An extension of the naive method just described could be to take the general kind of venue into account. A large number of Art Galleries is unlikely to influence the crime rate and so could be excluded. However, in reality those galleries woudl be accompanied by support businesses that intend to grab some footfall.
There is a lot of useful data collected for this project even if the kmeans approach did not yield definitive answers.